1 The forecast problem.

The problem is to forecast a time series. In particular, the time series is the Beer, Wine, and Distilled Alcoholic Beverages Sales as in the original Matt Dancho’s example. The data is taken from FRED (Federal Reserve Economic Data). The data belongs to the non-durable goods category, it includes U.S. merchant wholesalers, except manufacturers’ sales branches and offices sales. The monthly time series goes from 2010-01-01 to 2022-10-31. And the goal is to use 2022 data (10 months) as a test data to conduct the forecast.

For the full database details see: https://fred.stlouisfed.org/series/S4248SM144NCEN

1.1 The data.

Let’s load the R packages.

# Load libraries
library(fpp3)
library(h2o)        # ML Library.
library(timetk)     # Toolkit for working with time series in R.
library(tidyquant)  # Loads tidyverse, financial pkgs, used to get data.
library(dplyr)      # Database manipulation.
library(ggplot2)    # Plots.
library(tibble)     # Tables.
library(kableExtra) # Tables.
library(knitr)      
library(bit64)      # Useful in the machine learning workflow.
library(sweep)      # Broom-style tidiers for the forecast package.
library(forecast)   # Forecasting models and predictions package.
library(seasonal)
library(tictoc)

We can conveniently download the data directly from the FRED API in one line of code.

# Beer, Wine, Distilled Alcoholic Beverages, in Millions USD.
beer <- tq_get("S4248SM144NCEN", get = "economic.data", 
               from = "2010-01-01", to = "2022-10-31")

Let’s have a look of the data set. By default it says price, but these are sales figures in monetary terms. According to the main FRED reference, these are in millions of dollars, not seasonally adjusted.

head(beer)
## # A tibble: 6 × 3
##   symbol         date       price
##   <chr>          <date>     <int>
## 1 S4248SM144NCEN 2010-01-01  6558
## 2 S4248SM144NCEN 2010-02-01  7481
## 3 S4248SM144NCEN 2010-03-01  9475
## 4 S4248SM144NCEN 2010-04-01  9424
## 5 S4248SM144NCEN 2010-05-01  9351
## 6 S4248SM144NCEN 2010-06-01 10552

We can change the name of the price column.

beer <- beer %>%
  rename(sales = price)

tail(beer)
## # A tibble: 6 × 3
##   symbol         date       sales
##   <chr>          <date>     <int>
## 1 S4248SM144NCEN 2022-05-01 16600
## 2 S4248SM144NCEN 2022-06-01 17700
## 3 S4248SM144NCEN 2022-07-01 15031
## 4 S4248SM144NCEN 2022-08-01 16860
## 5 S4248SM144NCEN 2022-09-01 16305
## 6 S4248SM144NCEN 2022-10-01 15418

Better now.

Visualization is particularly important for time series analysis and forecasting. It’s a good idea to identify spots where we will split the data into training and test. This kind of split is consistent with most machine learning algorithms. The training dataset is the sample of data used to fit and train the model by learning from the data. The test dataset is the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. The test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained. The test set is generally what is used to evaluate competing models.

It is also important to see the time series because normally the models will perform better if we can identify basic characteristics such as trend and seasonality. This data set clearly has a trend and a seasonality as people drink more alcohol in December.

beer %>%
  ggplot(aes(date, sales)) +
  # Train Region:
  annotate("text", x = ymd("2013-01-01"), y = 14000, 
           color = "black", label = "Train region") +
  geom_rect(xmin = as.numeric(ymd("2022-01-01")), 
            xmax = as.numeric(ymd("2022-09-30")), ymin = 0, ymax = 20000, 
            alpha = 0.02, fill = "pink") +
  annotate("text", x = ymd("2022-06-01"), y = 9000,
           color = "black", label = "Test\nregion") +
  # Data.
  geom_line(col = "black") +
  geom_point(col = "black", alpha = 0.5, size = 2) +
  # Aesthetics.
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(subtitle = 
  "Train (2010 - 2021), and test set (Jan 2022 to Oct 2022)",
  x = "Date", y = "Sales",
       caption = "The models do not know the test region, this is for us 
       to see how well the models do the 10-month ahead forecast.") 
Beer, Wine, and Distilled Alcoholic Beverages Sales.

Figure 1.1: Beer, Wine, and Distilled Alcoholic Beverages Sales.

Then, the problem is to forecast the 10 months of the test region. This is, from January to October 2022.

Here is a zoom version of the plot above.

beer %>%
  filter(date > as.Date("2020-01-01")) %>%
  ggplot(aes(date, sales)) +
  annotate("text", x = ymd("2020-08-01"), y = 17000, 
           color = "black", label = "Train region") +
  geom_rect(xmin = as.numeric(ymd("2022-01-01")), 
            xmax = as.numeric(ymd("2022-09-30")), ymin = 0, ymax = 20000, 
            alpha = 0.02, fill = "pink") +
  annotate("text", x = ymd("2022-05-01"), y = 14000,
           color = "black", label = "Test region") +
  geom_line(col = "black") +
  geom_point(col = "black", alpha = 0.5, size = 5) +
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(subtitle = 
  "Train (2010 - 2021) and test set (Jan 2022 to Oct 2022)",
  x = "Date", y = "Sales",
       caption = "The models do not know the test region, this is for us 
       to see how well the models do the 10-month ahead forecast.") 
Zoom: Beer, Wine, and Distilled Alcoholic Beverages Sales.

Figure 1.2: Zoom: Beer, Wine, and Distilled Alcoholic Beverages Sales.

1.2 Time series properties.

The forecasting techniques are expected to exploit the time-series components like trend and seasonal component. Here we use the Hyndman and Athanasopoulos (2021) fpp3 package to learn about the time series properties before conducting the forecast techniques. In order to use the fpp3 package, we have to transform beer from a tibble to a tsibble object.

beer_tbls <- beer
beer_tbls$date <- yearmonth(beer_tbls$date)
beer_tbls <- as_tsibble(beer_tbls)

According to Hyndman and Athanasopoulos (2021), the X-11 method was originated in the US Census Bureau and further developed by Statistics Canada. The decomposition process tends to be highly robust to outliers and level shifts in the time series. The details of the X-11 method are described in Dagum and Bianconcini (2016).

beer_tbls %>%
  model(x11 = X_13ARIMA_SEATS(sales ~ x11())) %>%
  components() %>%
  autoplot() +
  labs(y = "Sales", x = "Date")
A multiplicative decomposition of beer sales using X-11.

Figure 1.3: A multiplicative decomposition of beer sales using X-11.

The trend shows a clear change in the first half of 2020. Let’s take a look of it.

beer_tbls %>%
  model(x11 = X_13ARIMA_SEATS(sales ~ x11())) %>%
  components() %>%
  select(date, sales, trend) %>%
  filter_index("2020-03" ~ "2020-10") %>%
  head(8)
## # A tsibble: 8 x 3 [1M]
##        date sales  trend
##       <mth> <dbl>  <dbl>
## 1 2020 mar. 13381 13711.
## 2 2020 abr. 12232 13794.
## 3 2020 may. 13997 13879.
## 4 2020 jun. 16112 13951.
## 5 2020 jul. 15670 15356.
## 6 2020 ago. 15281 15363.
## 7 2020 sep. 15506 15340.
## 8 2020 oct. 16188 15289.

The consumption trend significantly increased from June 2020 to July 2020. It is big enough to create a discontinuity in the trend plot below.

beer_tbls %>%
  model(x11 = X_13ARIMA_SEATS(sales ~ x11())) %>%
  components() %>%
  select(date, trend) %>%
  ggplot(aes(date, trend)) + 
  geom_point(alpha = 0.3, size = 4) +
  geom_point(aes(x = yearmonth("2020 jun."), y = 13950.99), 
             alpha = 0.3, size = 4, colour = "red") +
  geom_point(aes(x = yearmonth("2020 jul."), y = 15356.13), 
             alpha = 0.3, size = 4, colour = "red") +
  labs(y = "Trend", x = "Date")
Trend discontinuity.

Figure 1.4: Trend discontinuity.

There are also three negative spikes in the irregular component.

beer_tbls %>%
  model(x11 = X_13ARIMA_SEATS(sales ~ x11())) %>%
  components() %>%
  select(date, sales, irregular) %>%
  filter(irregular < 0.95) %>%
  head(3)
## # A tsibble: 3 x 3 [1M]
##        date sales irregular
##       <mth> <dbl>     <dbl>
## 1 2020 abr. 12232     0.889
## 2 2020 dic. 16335     0.869
## 3 2021 dic. 18139     0.910
beer_tbls %>%
  model(x11 = X_13ARIMA_SEATS(sales ~ x11())) %>%
  components() %>%
  select(date, irregular) %>%
  ggplot(aes(date, irregular)) + 
  geom_point(alpha = 0.3, size = 4) +
  geom_point(aes(x = yearmonth("2020 apr."), y = 0.8894670), 
             alpha = 0.3, size = 4, colour = "red") +
  geom_point(aes(x = yearmonth("2020 dec."), y = 0.8688517), 
             alpha = 0.3, size = 4, colour = "red") +
  geom_point(aes(x = yearmonth("2021 dec."), y = 0.9099203), 
             alpha = 0.3, size = 4, colour = "red") +
  labs(y = "Irregular", x = "Date")
Three negative spikes in the irregular component.

Figure 1.5: Three negative spikes in the irregular component.

Apparently, these trend and irregular events are consistent independently of the decomposition technique. We implement the SEATS decomposition below. SEATS stands for Seasonal Extraction in ARIMA Time Series. According to Hyndman and Athanasopoulos (2021), this procedure was developed at the Bank of Spain, and is now widely used by government agencies around the world. See Dagum and Bianconcini (2016) for further details.

beer_tbls %>%
  model(seats = X_13ARIMA_SEATS(sales ~ seats())) %>%
  components() %>%
  autoplot() +
  labs(y = "Sales", x = "Date")
A decomposition of beer sales obtained using SEATS.

Figure 1.6: A decomposition of beer sales obtained using SEATS.

A seasonal plot is similar to a time plot except that the data are plotted against the individual seasons, in this case months, in which the data were observed.

beer_tbls %>%
  gg_season(sales, labels = "both") +
  labs(y = "Sales", x = "Date")
Seasonal plot: Beer sales.

Figure 1.7: Seasonal plot: Beer sales.

And this is the zoom version of the plot above.

beer_tbls %>%
  filter_index("2019-01" ~ .) %>%
  gg_season(sales, labels = "both", size = 2) +
  geom_point() +
  labs(y = "Sales", x = "Date")
Zoom. Seasonal plot: Beer sales.

Figure 1.8: Zoom. Seasonal plot: Beer sales.

An alternative plot that emphasises the seasonal patterns is where the data for each month are collected together in separate mini time plots.

beer_tbls %>%
  gg_subseries(sales) +
  labs(y = "Sales", x = "Date")
Seasonal subseries plot of monthly beer sales.

Figure 1.9: Seasonal subseries plot of monthly beer sales.

December is the month of the year with the highest average sales, followed by June. January is the month of the year with the lowest average sales, followed by February.

Following Hyndman and Athanasopoulos (2021), just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series. When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase. When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period, in this case 12) than for other lags.

beer_tbls %>%
  ACF(sales, lag_max = 60) %>%
  autoplot()
ACF of monthly beer sales.

Figure 1.10: ACF of monthly beer sales.

Beer sales is both trended and seasonal. The slow decrease in the ACF as the lags increase is due to the trend, while the scalloped shape is due to the seasonality.

2 Forecasts with fpp3.

Here we implement some selected forecast techniques using the Hyndman and Athanasopoulos (2021) fpp3 package. First define the training and test set (2022).

beer_train <- beer_tbls %>%
  select(date, sales) %>%
  filter_index(. ~ "2021-12")

beer_2022 <- beer_tbls %>%
  select(date, sales) %>%
  filter_index("2022-01" ~ .)

2.1 Four simple techniques.

Let’s estimate four simple forecast techniques. Mean, where the forecasts of all future values are equal to the average of the historical data. Naïve, we set all forecasts to be the value of the last observation. Seasonal naïve, we set each forecast to be equal to the last observed value from the same season. And drift, to allow the forecasts to increase or decrease over time.

Estimate the four models.

beer_fit <- beer_train %>%
  model(Mean = MEAN(sales), `Naïve` = NAIVE(sales),
    `Seasonal naïve` = SNAIVE(sales), Drift = RW(sales ~ drift()))
glance(beer_fit)
## # A tibble: 4 × 2
##   .model           sigma2
##   <chr>             <dbl>
## 1 Mean           5439796.
## 2 Naïve          3267385.
## 3 Seasonal naïve  481887.
## 4 Drift          3267385.

The 10-month forecasts.

beer_fc <- beer_fit %>%
  fabletools::forecast(h = "10 months")

Let’s compute the MAPE of all forecasts.

mape_table <- fabletools::accuracy(beer_fc, beer_2022) %>%
  select(.model, MAPE) %>%
  arrange(desc(-MAPE))
mape_table
## # A tibble: 4 × 2
##   .model          MAPE
##   <chr>          <dbl>
## 1 Seasonal naïve  3.76
## 2 Naïve          18.6 
## 3 Drift          21.4 
## 4 Mean           22.9
sn_mape <- fabletools::accuracy(beer_fc, beer_2022) %>%
  filter(.model == "Seasonal naïve") %>%
  select(MAPE) %>%
  unlist()
sn_mape
##     MAPE 
## 3.756953

Let’s plot the forecast results.

beer_fc %>%
  autoplot(beer_tbls, level = NULL) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(y = "Sales", x = "Date") +
  guides(colour = guide_legend(title = "Forecast:")) +
  theme(legend.position = "bottom")
Four simple forecasts.

Figure 2.1: Four simple forecasts.

The plot above is not very clear. Here is a zoom version.

beer_zoom <- beer_tbls %>%
  select(date, sales) %>%
  filter_index("2019-12" ~ .)

beer_fc %>%
  autoplot(beer_zoom, level = NULL, lwd = 2)  +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(y = "Sales", x = "Date") +
  guides(colour = guide_legend(title = "Forecast:")) +
  theme(legend.position = "bottom")
Zoom: Four simple forecasts.

Figure 2.2: Zoom: Four simple forecasts.

This is the seasonal naïve forecasts.

beer_zoom <- beer_tbls %>%
  select(date, sales) %>%
  filter_index("2019-12" ~ .)

beer_sn_fc <- beer_fc %>%
  filter(.model == "Seasonal naïve") 

  ggplot(beer_zoom, aes(date, sales), lwd = 2, alpha = 0.4) +
  geom_line() +
  geom_point(size = 5, color = "black", alpha = 0.5, 
             shape = 21, fill = "black") +
  geom_point(aes(y = .mean), size = 5, 
             color = "red", alpha = 0.5, shape = 21, 
             fill = "red", data = beer_sn_fc) +
  geom_line(aes(y = .mean), color = "red", size = 0.5, data = beer_sn_fc) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(y = "Sales", x = "Date",
       caption = c(paste("MAPE=",(round(sn_mape, 5))))) +
  theme(legend.position = "bottom")
Zoom: Seasonal naïve forecasts.

Figure 2.3: Zoom: Seasonal naïve forecasts.

The simple techniques are not necessarily bad techniques.

2.2 Exponential smoothing.

According to Hyndman and Athanasopoulos (2021), forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight. This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry.

In this subsection, we let the ETS() function select the model by minimising the AICc.

beer_ets <- beer_train %>%
  model(ETS(sales))
report(beer_ets)
## Series: sales 
## Model: ETS(M,A,M) 
##   Smoothing parameters:
##     alpha = 0.136482 
##     beta  = 0.007637814 
##     gamma = 0.0001003809 
## 
##   Initial states:
##      l[0]     b[0]     s[0]    s[-1]    s[-2]     s[-3]    s[-4]     s[-5]
##  9238.444 36.54929 1.166739 1.029367 1.035279 0.9918264 1.049006 0.9925107
##     s[-6]    s[-7]     s[-8]     s[-9]    s[-10]    s[-11]
##  1.127315 1.065248 0.9630721 0.9828719 0.8356762 0.7610889
## 
##   sigma^2:  0.0021
## 
##      AIC     AICc      BIC 
## 2535.902 2540.759 2586.388

The ETS(M,A,M) corresponds to a Holt-Winters multiplicative method with multiplicative errors for when seasonal variations are changing proportional to the level of the series.

components(beer_ets) %>%
  autoplot() +
  labs(x = "Date", y = "Sales")
ETS components.

Figure 2.4: ETS components.

The ETS(M,A,M) 10-month forecast.

beer_ets_fc <- beer_ets %>%
  fabletools::forecast(h = 10)
beer_ets_fc
## # A fable: 10 x 4 [1M]
## # Key:     .model [1]
##    .model          date            sales  .mean
##    <chr>          <mth>           <dist>  <dbl>
##  1 ETS(sales) 2022 ene.  N(12160, 3e+05) 12160.
##  2 ETS(sales) 2022 feb. N(13424, 378560) 13424.
##  3 ETS(sales) 2022 mar. N(15874, 541048) 15874.
##  4 ETS(sales) 2022 abr. N(15637, 537474) 15637.
##  5 ETS(sales) 2022 may. N(17389, 681269) 17389.
##  6 ETS(sales) 2022 jun. N(18500, 791534) 18500.
##  7 ETS(sales) 2022 jul. N(16373, 637298) 16373.
##  8 ETS(sales) 2022 ago. N(17396, 740347) 17396.
##  9 ETS(sales) 2022 sep. N(16534, 689031) 16534.
## 10 ETS(sales) 2022 oct. N(17348, 782364) 17348.

The ETS(M,A,M) MAPE.

ets_mape <- fabletools::accuracy(beer_ets_fc, beer_2022) %>%
  select(MAPE) %>%
  unlist()
ets_mape
##     MAPE 
## 4.009512

Let’s see the ETS(M,A,M) forecast.

beer_ets %>%
  fabletools::forecast(h = 10) %>%
  autoplot(beer_tbls) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(x = "Date", y = "Sales") +
  theme(legend.position = "bottom")
ETS(M,A,M) forecast.

Figure 2.5: ETS(M,A,M) forecast.

This is not very clear, here is the zoom version.

  ggplot(beer_zoom, aes(date, sales), lwd = 2, alpha = 0.4) +
  geom_line() +
  geom_point(size = 5, color = "black", alpha = 0.5, 
             shape = 21, fill = "black") +
  geom_point(aes(y = .mean), size = 5, 
             color = "red", alpha = 0.5, shape = 21, 
             fill = "red", data = beer_ets_fc) +
  geom_line(aes(y = .mean), color = "red", size = 0.5, data = beer_ets_fc) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(x = "Date", y = "Sales",
       caption = c(paste("MAPE=",(round(ets_mape, 5))))) +
  theme(legend.position = "bottom")
Zoom: ETS(M,A,M) forecast.

Figure 2.6: Zoom: ETS(M,A,M) forecast.

Update the MAPE table.

mape_updated <- mape_table %>%
  add_row(.model = "ETS(M,A,M)", MAPE = ets_mape) %>%
  arrange(desc(-MAPE))
mape_updated
## # A tibble: 5 × 2
##   .model          MAPE
##   <chr>          <dbl>
## 1 Seasonal naïve  3.76
## 2 ETS(M,A,M)      4.01
## 3 Naïve          18.6 
## 4 Drift          21.4 
## 5 Mean           22.9

2.3 ARIMA.

According to Hyndman and Athanasopoulos (2021), while exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. The ARIMA() function combines unit root tests, minimisation of the AICc and MLE to obtain an ARIMA model. By setting stepwise = FALSE and approximation = FALSE, we are making R work extra hard to find a good model.

beer_arima <- beer_train %>%
  model(arima_auto = ARIMA(sales, stepwise = FALSE, approx = FALSE))
report(beer_arima)
## Series: sales 
## Model: ARIMA(4,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ma1     sma1
##       -0.0613  -0.0096  0.2284  -0.3343  -0.7837  -0.6760
## s.e.   0.1048   0.1037  0.0959   0.0846   0.0802   0.1013
## 
## sigma^2 estimated as 310397:  log likelihood=-1015.85
## AIC=2045.7   AICc=2046.61   BIC=2065.82

The residuals for the best ARIMA model.

beer_arima %>%
  select(arima_auto) %>%
  gg_tsresiduals()
Residuals from the fitted ARIMA(4,1,1)(0,1,1)[12] model.

Figure 2.7: Residuals from the fitted ARIMA(4,1,1)(0,1,1)[12] model.

The forecast for the best ARIMA model.

beer_arima_fc <- beer_arima %>%
  fabletools::forecast(h = "10 months")
beer_arima_fc
## # A fable: 10 x 4 [1M]
## # Key:     .model [1]
##    .model          date            sales  .mean
##    <chr>          <mth>           <dist>  <dbl>
##  1 arima_auto 2022 ene. N(12585, 310421) 12585.
##  2 arima_auto 2022 feb. N(14240, 317880) 14240.
##  3 arima_auto 2022 mar. N(15719, 329948) 15719.
##  4 arima_auto 2022 abr. N(15565, 387641) 15565.
##  5 arima_auto 2022 may. N(16969, 391461) 16969.
##  6 arima_auto 2022 jun. N(17924, 405432) 17924.
##  7 arima_auto 2022 jul. N(16682, 422853) 16682.
##  8 arima_auto 2022 ago. N(17260, 423137) 17260.
##  9 arima_auto 2022 sep. N(16507, 450648) 16507.
## 10 arima_auto 2022 oct. N(17270, 460810) 17270.

The best ARIMA MAPE.

arima_mape <- fabletools::accuracy(beer_arima_fc, beer_2022) %>%
  select(MAPE) %>%
  unlist()
arima_mape
##     MAPE 
## 4.529671

Let’s plot the forecast results.

beer_arima_fc %>%
  autoplot(beer_tbls, level = NULL) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(y = "Sales", x = "Date") 
ARIMA(4,1,1)(0,1,1)[12] forecast.

Figure 2.8: ARIMA(4,1,1)(0,1,1)[12] forecast.

The plot above is not very clear. Here is a zoom version.

  ggplot(beer_zoom, aes(date, sales), lwd = 2, alpha = 0.4) +
  geom_line() +
  geom_point(size = 5, color = "black", alpha = 0.5, 
             shape = 21, fill = "black") +
  geom_point(aes(y = .mean), size = 5, 
             color = "red", alpha = 0.5, shape = 21, 
             fill = "red", data = beer_arima_fc) +
  geom_line(aes(y = .mean), color = "red", size = 0.5, data = beer_arima_fc) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(x = "Date", y = "Sales",
       caption = c(paste("MAPE=",(round(arima_mape, 5))))) +
  theme(legend.position = "bottom")
Zoom. ARIMA(4,1,1)(0,1,1)[12] forecast.

Figure 2.9: Zoom. ARIMA(4,1,1)(0,1,1)[12] forecast.

Update the MAPE table.

mape_updated <- mape_updated %>%
  add_row(.model = "ARIMA(4,1,1)(0,1,1)[12]", MAPE = arima_mape) %>%
  arrange(desc(-MAPE))
mape_updated
## # A tibble: 6 × 2
##   .model                   MAPE
##   <chr>                   <dbl>
## 1 Seasonal naïve           3.76
## 2 ETS(M,A,M)               4.01
## 3 ARIMA(4,1,1)(0,1,1)[12]  4.53
## 4 Naïve                   18.6 
## 5 Drift                   21.4 
## 6 Mean                    22.9

2.4 Neural network.

Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors. With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in a linear autoregression model. We call this a neural network autoregression or NNAR model.

beer_nnet <- beer_train %>%
  model(NNETAR(sales))
report(beer_nnet)
## Series: sales 
## Model: NNAR(15,1,8)[12] 
## 
## Average of 20 networks, each of which is
## a 15-8-1 network with 137 weights
## options were - linear output units 
## 
## sigma^2 estimated as 727

The NNAR(15,1,8)[12] model has inputs \(y_{t−1}, y_{t−2},..., y_{t−15}\) and 8 neurons in the hidden layer.

The forecast for the best NNAR model.

beer_nnet_fc <- beer_nnet %>%
  fabletools::forecast(h = "10 months")
beer_nnet_fc
## # A fable: 10 x 4 [1M]
## # Key:     .model [1]
##    .model             date        sales  .mean
##    <chr>             <mth>       <dist>  <dbl>
##  1 NNETAR(sales) 2022 ene. sample[5000] 13222.
##  2 NNETAR(sales) 2022 feb. sample[5000] 14127.
##  3 NNETAR(sales) 2022 mar. sample[5000] 16521.
##  4 NNETAR(sales) 2022 abr. sample[5000] 15206.
##  5 NNETAR(sales) 2022 may. sample[5000] 16823.
##  6 NNETAR(sales) 2022 jun. sample[5000] 18650.
##  7 NNETAR(sales) 2022 jul. sample[5000] 17250.
##  8 NNETAR(sales) 2022 ago. sample[5000] 17494.
##  9 NNETAR(sales) 2022 sep. sample[5000] 16760.
## 10 NNETAR(sales) 2022 oct. sample[5000] 16988.

The best NNET MAPE.

nnet_mape <- fabletools::accuracy(beer_nnet_fc, beer_2022) %>%
  select(MAPE) %>%
  unlist()
nnet_mape
##     MAPE 
## 5.965551

Let’s plot the forecast results.

beer_nnet_fc %>%
  autoplot(beer_tbls, level = NULL) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(y = "Sales", x = "Date") 
NNAR(15,1,8)[12] forecast.

Figure 2.10: NNAR(15,1,8)[12] forecast.

The plot above is not very clear. Here is a zoom version.

  ggplot(beer_zoom, aes(date, sales), lwd = 2, alpha = 0.4) +
  geom_line() +
  geom_point(size = 5, color = "black", alpha = 0.5, 
             shape = 21, fill = "black") +
  geom_point(aes(y = .mean), size = 5, 
             color = "red", alpha = 0.5, shape = 21, 
             fill = "red", data = beer_nnet_fc) +
  geom_line(aes(y = .mean), color = "red", size = 0.5, data = beer_nnet_fc) +
  geom_vline(xintercept = as.Date("2022-01-01"), lty = 2) +
  labs(x = "Date", y = "Sales",
       caption = c(paste("MAPE=",(round(nnet_mape, 5))))) +
  theme(legend.position = "bottom")
Zoom. NNAR(15,1,8)[12] forecast.

Figure 2.11: Zoom. NNAR(15,1,8)[12] forecast.

Update the MAPE table.

mape_updated <- mape_updated %>%
  add_row(.model = "NNAR(15,1,8)[12]", MAPE = nnet_mape) %>%
  arrange(desc(-MAPE))
mape_updated
## # A tibble: 7 × 2
##   .model                   MAPE
##   <chr>                   <dbl>
## 1 Seasonal naïve           3.76
## 2 ETS(M,A,M)               4.01
## 3 ARIMA(4,1,1)(0,1,1)[12]  4.53
## 4 NNAR(15,1,8)[12]         5.97
## 5 Naïve                   18.6 
## 6 Drift                   21.4 
## 7 Mean                    22.9
# I save the mape_updated object to use it in hh2o.rmd
saveRDS(mape_updated, "mape_updated.rds")

3 Forecasts with H2O.

As in the previous section, the problem is to forecast a time series. In particular, the time series is the Beer, Wine, and Distilled Alcoholic Beverages Sales as in the original Matt Dancho’s example. The data is taken from FRED (Federal Reserve Economic Data). The data belongs to the non-durable goods category, it includes U.S. merchant wholesalers, except manufacturers’ sales branches and offices sales. The monthly time series goes from 2010-01-01 to 2022-10-31. And the goal is to use 2022 data (10 months) as a test data to conduct the forecast.

For the full database details see: https://fred.stlouisfed.org/series/S4248SM144NCEN

Let’s load the R packages.

# Load libraries
library(fpp3)
library(h2o)        # ML Library.
library(timetk)     # Toolkit for working with time series in R.
library(tidyquant)  # Loads tidyverse, financial pkgs, used to get data.
library(dplyr)      # Database manipulation.
library(ggplot2)    # Plots.
library(tibble)     # Tables.
library(kableExtra) # Tables.
library(knitr)      
library(bit64)      # Useful in the machine learning workflow.
library(sweep)      # Broom-style tidiers for the forecast package.
library(forecast)   # Forecasting models and predictions package.
library(seasonal)
library(tictoc)
# This comes from ffpp3.rmd
mape_updated <- readRDS("mape_updated.rds")
mape_updated
## # A tibble: 7 × 2
##   .model                   MAPE
##   <chr>                   <dbl>
## 1 Seasonal naïve           3.76
## 2 ETS(M,A,M)               4.01
## 3 ARIMA(4,1,1)(0,1,1)[12]  4.53
## 4 NNAR(15,1,8)[12]         5.97
## 5 Naïve                   18.6 
## 6 Drift                   21.4 
## 7 Mean                    22.9

We can conveniently download the data directly from the FRED API in one line of code.

# Beer, Wine, Distilled Alcoholic Beverages, in Millions USD.
beer <- tq_get("S4248SM144NCEN", get = "economic.data", 
               from = "2010-01-01", to = "2022-10-31")

Let’s have a look of the data set. By default it says price, but these are sales figures in monetary terms. According to the main FRED reference, these are in millions of dollars, not seasonally adjusted.

head(beer)
## # A tibble: 6 × 3
##   symbol         date       price
##   <chr>          <date>     <int>
## 1 S4248SM144NCEN 2010-01-01  6558
## 2 S4248SM144NCEN 2010-02-01  7481
## 3 S4248SM144NCEN 2010-03-01  9475
## 4 S4248SM144NCEN 2010-04-01  9424
## 5 S4248SM144NCEN 2010-05-01  9351
## 6 S4248SM144NCEN 2010-06-01 10552

We can change the name of the price column.

beer <- beer %>%
  rename(sales = price)

tail(beer)
## # A tibble: 6 × 3
##   symbol         date       sales
##   <chr>          <date>     <int>
## 1 S4248SM144NCEN 2022-05-01 16600
## 2 S4248SM144NCEN 2022-06-01 17700
## 3 S4248SM144NCEN 2022-07-01 15031
## 4 S4248SM144NCEN 2022-08-01 16860
## 5 S4248SM144NCEN 2022-09-01 16305
## 6 S4248SM144NCEN 2022-10-01 15418

Better now.

3.1 The H2O package.

Machine learning is the study of computer algorithms that improve automatically through experience. There are many ways and approaches to implement machine learning especially in time series forecasts purposes. This document heavily relies on h2o library. The h2o package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy to implement.

One of the most important features of this package is the h2o.automl() (Automatic Machine Learning). H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard. We can verify this in the example below.

This document has limited explanations about the applied machine learning techniques. The value of this document is to gather several examples that are originally presented separately in Business Science IO and R-bloggers. sites and extend the analysis to elaborate further on the code logic and interpretation. It can also be useful to better understand how the R functions work, how results are produced, and it could help to replicate a different example with a new database for those who are new in the field.

You have to download and install H2O. Click here for full instructions. You are also expected to review the H2O webpage contents because they have important information that will allow you to better understand the value of this machine learning tool.

The main objective here is to use h2o locally (in your own computer) to develop a high accuracy time series model on the beer data set. This is a supervised machine learning regression problem. An interesting reference to learn the basics of supervised and unsupervised machine learning techniques applied to business is: Hull (2020).

3.2 The data.

The validation dataset is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.

beer %>%
  ggplot(aes(date, sales)) +
  # Train Region:
  annotate("text", x = ymd("2013-01-01"), y = 14000, 
           color = "black", label = "Train region") +
  # Validation Region:
  geom_rect(xmin = as.numeric(ymd("2021-01-01")), 
            xmax = as.numeric(ymd("2021-12-31")), ymin = 0, ymax = 20000, 
            alpha = 0.01, fill = "purple") +
  annotate("text", x = ymd("2021-04-01"), y = 7000,
           color = "black", label = "Validation\nregion") +
  # Test Region:
  geom_rect(xmin = as.numeric(ymd("2022-01-01")), 
            xmax = as.numeric(ymd("2022-09-30")), ymin = 0, ymax = 20000, 
            alpha = 0.02, fill = "pink") +
  annotate("text", x = ymd("2022-06-01"), y = 9000,
           color = "black", label = "Test\nregion") +
  # Data.
  geom_line(col = "black") +
  geom_point(col = "black", alpha = 0.5, size = 2) +
  # Aesthetics.
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(subtitle = 
  "Train (2010 - 2020), validation (2021), and test set (Jan 2022 to Oct 2022)",
  x = "Date", y = "Sales",
       caption = "The models do not know the test region, this is for us 
       to see how well the models do the 10-month ahead forecast.") 
Beer sales: train, validation and test sets.

Figure 3.1: Beer sales: train, validation and test sets.

beer %>%
  filter(date > as.Date("2020-01-01")) %>%
  ggplot(aes(date, sales)) +
  # Train Region:
  annotate("text", x = ymd("2020-08-01"), y = 14000, 
           color = "black", label = "Train region") +
  # Validation Region:
  geom_rect(xmin = as.numeric(ymd("2021-01-01")), 
            xmax = as.numeric(ymd("2021-12-31")), ymin = 0, ymax = 20000, 
            alpha = 0.01, fill = "purple") +
  annotate("text", x = ymd("2021-07-01"), y = 14000,
           color = "black", label = "Validation region") +
  # Test Region:
  geom_rect(xmin = as.numeric(ymd("2022-01-01")), 
            xmax = as.numeric(ymd("2022-09-30")), ymin = 0, ymax = 20000, 
            alpha = 0.02, fill = "pink") +
  annotate("text", x = ymd("2022-05-01"), y = 14000,
           color = "black", label = "Test region") +
  # Data.
  geom_line(col = "black") +
  geom_point(col = "black", alpha = 0.5, size = 5) +
  # Aesthetics.
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(subtitle = 
  "Train (2010 - 2020), validation (2021), and test set (Jan 2022 to Oct 2022)",
  x = "Date", y = "Sales",
       caption = "The models do not know the test region, this is for us 
       to see how well the models do the 10-month ahead forecast.") 
Zoom. Beer sales: train, validation and test sets.

Figure 3.2: Zoom. Beer sales: train, validation and test sets.

The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use head() function for quick inspection of this expansion. See how there are now 31 features extracted from the original database. Not all will be important for the final and chosen models, but some will.

# See the full list of new variables to realize the expansion effect.
beer_aug <- beer %>%
  tk_augment_timeseries_signature() 

tail(beer_aug)
## # A tibble: 6 × 31
##   symbol       date       sales index…¹   diff  year year.…²  half quarter month
##   <chr>        <date>     <int>   <dbl>  <dbl> <int>   <int> <int>   <int> <int>
## 1 S4248SM144N… 2022-05-01 16600  1.65e9 2.59e6  2022    2022     1       2     5
## 2 S4248SM144N… 2022-06-01 17700  1.65e9 2.68e6  2022    2022     1       2     6
## 3 S4248SM144N… 2022-07-01 15031  1.66e9 2.59e6  2022    2022     2       3     7
## 4 S4248SM144N… 2022-08-01 16860  1.66e9 2.68e6  2022    2022     2       3     8
## 5 S4248SM144N… 2022-09-01 16305  1.66e9 2.68e6  2022    2022     2       3     9
## 6 S4248SM144N… 2022-10-01 15418  1.66e9 2.59e6  2022    2022     2       4    10
## # … with 21 more variables: month.xts <int>, month.lbl <ord>, day <int>,
## #   hour <int>, minute <int>, second <int>, hour12 <int>, am.pm <int>,
## #   wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>,
## #   yday <int>, mweek <int>, week <int>, week.iso <int>, week2 <int>,
## #   week3 <int>, week4 <int>, mday7 <int>, and abbreviated variable names
## #   ¹​index.num, ²​year.iso

Note how we went from 3 columns in beer to 31 columns in beer_aug.

We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer dplyr operations for these steps. Sometimes we do not need to implement this step as the data is already clean (as in this case), but sometimes it is not. Thus, let’s clean the data.

# See the full list of variables to realize the cleaning effect.
beer_clean <- beer_aug %>%
  select_if(~ !is.Date(.)) %>%
  select_if(~ !any(is.na(.))) %>%
  mutate_if(is.ordered, ~ as.character(.) %>% as.factor)

head(beer_clean)
## # A tibble: 6 × 29
##   symbol   sales index…¹  year year.…²  half quarter month month…³ month…⁴   day
##   <chr>    <int>   <dbl> <int>   <int> <int>   <int> <int>   <int> <fct>   <int>
## 1 S4248SM…  6558  1.26e9  2010    2009     1       1     1       0 enero       1
## 2 S4248SM…  7481  1.26e9  2010    2010     1       1     2       1 febrero     1
## 3 S4248SM…  9475  1.27e9  2010    2010     1       1     3       2 marzo       1
## 4 S4248SM…  9424  1.27e9  2010    2010     1       2     4       3 abril       1
## 5 S4248SM…  9351  1.27e9  2010    2010     1       2     5       4 mayo        1
## 6 S4248SM… 10552  1.28e9  2010    2010     1       2     6       5 junio       1
## # … with 18 more variables: hour <int>, minute <int>, second <int>,
## #   hour12 <int>, am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <fct>,
## #   mday <int>, qday <int>, yday <int>, mweek <int>, week <int>,
## #   week.iso <int>, week2 <int>, week3 <int>, week4 <int>, mday7 <int>, and
## #   abbreviated variable names ¹​index.num, ²​year.iso, ³​month.xts, ⁴​month.lbl

The database did not change too much. Now we have 29 columns in beer_clean. In the case of two variables, the structure ordered factors <ord> changed into factors <fct>, which is necessary for some H2O functions.

Let’s split the database into a training, validation and test sets following the time ranges in the visualization above. These training sets are the way most machine learning algorithms can be implemented and evaluated. We normally take more observations for the training, and less observations for the validation and test. The test set (the most recent dates) is unknown in the learning process of the models, the test set will be useful for us to be able to compare forecasts versus what really happened. This is how we can measure out-of-sample estimation errors.

# Split into training, validation and test sets.
train_tbl <- beer_clean %>% filter(year < 2021)
valid_tbl <- beer_clean %>% filter(year == 2021)
test_tbl  <- beer_clean %>% filter(year == 2022)

test_tbl$sales
##  [1] 11926 13333 16165 15584 16600 17700 15031 16860 16305 15418

Remember our goal is to forecast the first 10 months of 2022.

3.3 Prepare for H2O.

First, fire up H2O. This will initialize the Java Virtual Machine (JVM) that H2O uses locally. In simple terms, here your local computer will remotely connect to a high-power clusters to do the H2O machine learning job. This is not only amazing, it is also free.

h2o.init() # Fire up h2o.
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 57 minutes 
##     H2O cluster timezone:       America/Mexico_City 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.38.0.1 
##     H2O cluster version age:    2 months and 22 days  
##     H2O cluster name:           H2O_started_from_R_ML_usi385 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.29 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.2.1 (2022-06-23 ucrt)

We need the data sets in a format that can be readable by H2O. This is an easy step.

# Convert to H2OFrame objects.
h2o.no_progress() # We do not need a progress bar here.
train_h2o <- as.h2o(train_tbl)
valid_h2o <- as.h2o(valid_tbl)
test_h2o  <- as.h2o(test_tbl)

Let’s list the names of the variables.

# Set names for h2o.
y <- "sales"
x <- setdiff(names(train_h2o), y) # Adds sales to the names list.
kable(matrix(x, 7, 4), caption = "Summary of variable names.") %>%
kable_styling(latex_options = "HOLD_position")
Table 3.1: Summary of variable names.
symbol month.xts am.pm mweek
index.num month.lbl wday week
year day wday.xts week.iso
year.iso hour wday.lbl week2
half minute mday week3
quarter second qday week4
month hour12 yday mday7

The h2o.automl() is a function in H2O that automates the process of building a large number of models, with the goal of finding the best model without any prior knowledge or effort by the data scientist. The alternative of using h2o.automl() is to pick some models according to the database characteristics, implement the models, and pick the one with the best performance according to some evaluation criterion. This alternative is time consuming and it could use an intensive computational memory and power, this is why H2O is valuable. If H2O was already amazing, this function makes it even more powerful.

The available algorithms that h2o.automl() currently run and compare are (click on each one to see a full description):

It is a good time to define how we are going to use some concepts at least in this document. Here, we call forecasting techniques to the three techniques implemented in this document: machine learning using H2O, linear regression, and ARIMA. When we implement h2o.automl() function, H2O test for the six algorithms listed above. Each algorithm includes many other models that belongs to these algorithms in the machine learning process. The result of h2o.automl() is one model that belongs to one algorithm. This is the difference between forecasting techniques, algorithms, and models.

3.4 Estimate H2O models.

Here, we implement the h2o.automl() in three different ways because of reproducibility issues. Reproducibility means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis. It turns out that Deep Learning cannot be reproducible by construction. Then, we first apply h2o.automl() without Deep Learning. Second, we apply h2o.automl() with only Deep Learning (here the results will be different each time we run the code). And third, including all available algorithms in h2o.automl() (again, the results might change every time we run the code). The first is the only one which can be reproducible and the other two are expected to change every time we run the R code.

Please note that in the code below we set exclude_algos to exclude Deep Learning, and seed = 13 to make sure every time we run the code we can get the same results.

# It takes some time to run. 
tic("inner1")

automl_models_h2o <- h2o.automl(x = x, y = y, training_frame = train_h2o, 
    validation_frame = valid_h2o, leaderboard_frame = test_h2o,
    exclude_algos = c("DeepLearning"), # without Deep Learning.
    #max_models = 10, # We can adjust this to save time.
    max_runtime_secs = 60, stopping_metric = "RMSE", seed = 13,
    verbosity = NULL)

t_inner1 <- toc()
## inner1: 508.55 sec elapsed

After 8.48 minutes running, the selected model by h2o.automl() is:

# Extract leader model.
automl_leader <- automl_models_h2o@leader
automl_leader@algorithm
## [1] "stackedensemble"

See why stackedensemble was the chosen one:

# Show the first 10.
kable(head(automl_models_h2o@leaderboard[,1:2], 10),
caption = "Model rankings: h2o.automl without Deep Learning algorithm.", 
digits = 2, row.names = TRUE) %>%
kable_styling(font_size = 16, latex_options = "HOLD_position")
Table 3.2: Model rankings: h2o.automl without Deep Learning algorithm.
model_id rmse
1 StackedEnsemble_BestOfFamily_5_AutoML_7_20221212_111148 952.22
2 StackedEnsemble_BestOfFamily_3_AutoML_7_20221212_111148 958.86
3 GBM_grid_1_AutoML_7_20221212_111148_model_39 1084.30
4 GBM_grid_1_AutoML_7_20221212_111148_model_64 1133.66
5 GBM_grid_1_AutoML_7_20221212_111148_model_118 1142.00
6 GBM_grid_1_AutoML_7_20221212_111148_model_123 1149.87
7 GBM_grid_1_AutoML_7_20221212_111148_model_115 1166.56
8 StackedEnsemble_BestOfFamily_4_AutoML_7_20221212_111148 1227.81
9 GBM_grid_1_AutoML_7_20221212_111148_model_17 1288.45
10 GBM_grid_1_AutoML_7_20221212_111148_model_16 1300.10

The model_id column list the top 10 models with the lowest errors. The value of h2o.automl() is that we can take the best model and use it to conduct our forecast. Remember we proposed to run h2o.automl() three times. Now let’s consider the second alternative (only Deep Learning). There are several ways to implement Deep Learning, this is why it makes sense to use only this family into the h2o.automl() function. Deep Learning cannot be reproducible by construction so adding a seed in this case would be useless.

# This might take some time to run. 
tic("inner2")

DL <- h2o.automl(x = x, y = y, training_frame = train_h2o, 
    validation_frame = valid_h2o, leaderboard_frame = test_h2o,
    include_algos = c("DeepLearning"), max_runtime_secs = 60, 
    stopping_metric = "RMSE", verbosity = NULL)

t_inner2 <- toc()
## inner2: 183.91 sec elapsed

After 3.07 minutes running, the selected model by h2o.automl() is:

# Extract leader model
automl_DL <- DL@leader
automl_DL@algorithm
## [1] "deeplearning"

See why this specific deeplearning model was the chosen one:

kable(head(DL@leaderboard[,1:2], 10), 
  caption = "Model rankings: h2o.automl with only Deep Learning algorithm.",
  digits = 2, row.names = TRUE) %>%
  kable_styling(font_size = 16, latex_options = "HOLD_position")
Table 3.3: Model rankings: h2o.automl with only Deep Learning algorithm.
model_id rmse
1 DeepLearning_grid_1_AutoML_8_20221212_112018_model_2 1067.65
2 DeepLearning_grid_1_AutoML_8_20221212_112018_model_1 1200.23
3 DeepLearning_grid_1_AutoML_8_20221212_112018_model_3 1299.89
4 DeepLearning_grid_1_AutoML_8_20221212_112018_model_7 1342.15
5 DeepLearning_1_AutoML_8_20221212_112018 1422.94
6 DeepLearning_grid_1_AutoML_8_20221212_112018_model_6 1729.64
7 DeepLearning_grid_1_AutoML_8_20221212_112018_model_5 2184.52
8 DeepLearning_grid_1_AutoML_8_20221212_112018_model_4 5195.68

All models belong to the same algorithm, but we clearly choose the first one of the list. The machine learning workflow estimate a number of models using the train region and evaluate them using the validation region. The estimated model parameters then change as they learn from their mistakes. This process is repeated until a specific restriction meets, in this case max_runtime_secs is set to 60 seconds. At the end, we select the best ranked model.

Now let’s consider the third alternative. This is, run h2o.automl() with no restrictions at all. Here, it would be interesting to see if this led to the best alternative. In principle, we cannot anticipate which one of these three runs will be the best. This is because the Deep Learning algorithm has a random component which might lead to better results, and remember the second round was exclusive for Deep Learning and the third includes Deep Learning. Then, every time I compile this document or run this R code we should expect different results in the second and third alternative.

tic("inner3")

# This might take some time to run. 
automl_models_h2o_all <- h2o.automl(x = x, y = y, 
    training_frame = train_h2o, validation_frame = valid_h2o, 
    leaderboard_frame = test_h2o, max_runtime_secs = 60, 
    stopping_metric = "RMSE", verbosity = NULL)

t_inner3 <- toc()
## inner3: 479.92 sec elapsed

After 8 minutes running, the selected model by h2o.automl() is:

# Extract leader model
automl_leader_all <- automl_models_h2o_all@leader
automl_leader_all@algorithm
## [1] "deeplearning"

See why deeplearning model was the chosen one in this specific and unique code compilation:

# Let's show the first 10 of the list.
kable(head(automl_models_h2o_all@leaderboard[,1:2], 10), 
    caption = "Model rankings: h2o.automl with all available algorithms.",
    digits = 2, row.names = TRUE) %>%
  kable_styling(font_size = 16, latex_options = "HOLD_position")
Table 3.4: Model rankings: h2o.automl with all available algorithms.
model_id rmse
1 DeepLearning_grid_1_AutoML_9_20221212_112324_model_1 825.27
2 StackedEnsemble_BestOfFamily_3_AutoML_9_20221212_112324 961.02
3 GBM_grid_1_AutoML_9_20221212_112324_model_47 1175.62
4 StackedEnsemble_AllModels_4_AutoML_9_20221212_112324 1256.80
5 GBM_grid_1_AutoML_9_20221212_112324_model_16 1258.37
6 StackedEnsemble_AllModels_3_AutoML_9_20221212_112324 1290.46
7 GBM_grid_1_AutoML_9_20221212_112324_model_67 1349.82
8 GBM_grid_1_AutoML_9_20221212_112324_model_11 1388.04
9 GBM_grid_1_AutoML_9_20221212_112324_model_2 1395.35
10 GBM_grid_1_AutoML_9_20221212_112324_model_23 1401.64

Let’s summarize the results according to the mean residual deviance as this was the criterion in stopping_metric. The table shows the best ranked model according to our three different runs of h2o.automl().

# Collect model names and the rmse.
without_DL <- c(automl_leader@algorithm, 
                round(automl_models_h2o@leaderboard[1, 2], 2))
only_DL <- c(automl_DL@algorithm, 
             round(DL@leaderboard[1,2], 2))
all <- c(automl_leader_all@algorithm, 
         round(automl_models_h2o_all@leaderboard[1, 2], 2))
# Three different runs of h2o.automl.
automl_three <- data.frame(without_DL, only_DL, all)
colnames(automl_three) <- c("Without Deep Learning", "Only Deep Learning",
                            "All algorithms")
kable(automl_three, 
caption = "Top ranked models: h2o.automl rmse.") %>%
kable_styling(latex_options = "HOLD_position")
Table 3.5: Top ranked models: h2o.automl rmse.
Without Deep Learning Only Deep Learning All algorithms
stackedensemble deeplearning deeplearning
952.22 1067.65 825.27

This is interesting because this suggest that it makes sense to run the H2O more than one time. It would be good to test for a different stopping_metric, max_runtime_secs and max_models.

3.5 Predictions.

Here are how the forecasts are calculated.

# The h2o.predict function do the job.
pred_h2o <- h2o.predict(automl_leader, newdata = test_h2o)
pred_h2o_DL <- h2o.predict(automl_DL, newdata = test_h2o)
pred_h2o_all <- h2o.predict(automl_leader_all, newdata = test_h2o)

Let’s show the results in a table. First, the case without Deep Learning.

# 10-period forecast error: h2o.automl without Deep Learning.
error_tbl <- beer %>% 
    filter(lubridate::year(date) == 2022) %>%
    add_column(pred = pred_h2o %>% as_tibble() %>% pull(predict)) %>%
    rename(actual = sales) %>%
    mutate(error = actual - pred, error_pct = error / actual)
kable(error_tbl, 
caption = "Detailed performance: h2o.automl without Deep Learning algorithm.",
digits = 3, row.names = TRUE) %>%
kable_styling(latex_options = "HOLD_position")
Table 3.6: Detailed performance: h2o.automl without Deep Learning algorithm.
symbol date actual pred error error_pct
1 S4248SM144NCEN 2022-01-01 11926 11685.06 240.942 0.020
2 S4248SM144NCEN 2022-02-01 13333 12623.77 709.228 0.053
3 S4248SM144NCEN 2022-03-01 16165 14892.46 1272.543 0.079
4 S4248SM144NCEN 2022-04-01 15584 14042.03 1541.969 0.099
5 S4248SM144NCEN 2022-05-01 16600 15870.11 729.888 0.044
6 S4248SM144NCEN 2022-06-01 17700 16355.37 1344.633 0.076
7 S4248SM144NCEN 2022-07-01 15031 14747.94 283.056 0.019
8 S4248SM144NCEN 2022-08-01 16860 15898.36 961.635 0.057
9 S4248SM144NCEN 2022-09-01 16305 15383.26 921.735 0.057
10 S4248SM144NCEN 2022-10-01 15418 15978.25 -560.247 -0.036

The forecast looks good. Note that in some cases it over-estimate and in others under-estimate the real values, but in general these differences are small. Now, let’s look at the same information in a plot.

beer %>%
  filter(date > as.Date("2021-01-01")) %>%
    ggplot(aes(x = date, y = sales)) +
    # Data.
    geom_point(size = 5, color = "black", alpha = 0.5, 
               shape = 21, fill = "black") +
    geom_line(color = "black", size = 0.5) +
    # Predictions.
    geom_point(aes(y = pred), size = 5, 
               color = "black", alpha = 0.5, shape = 21, 
               fill = "red", data = error_tbl) +
geom_line(aes(y = pred), color = "red", size = 0.5, data = error_tbl) +
geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), linetype = 2) +
    # Aesthetics.
    labs(x = "Date", y = "Sales",
    caption = c(paste("MAPE=",((mean(abs(error_tbl$error_pct))*100)))))
Forecast: H2O without Deep Learning algorithm.

Figure 3.3: Forecast: H2O without Deep Learning algorithm.

This is an additional performance summary.

# Without Deep Learning.
h2o.performance(automl_leader, newdata = test_h2o)
## H2ORegressionMetrics: stackedensemble
## 
## MSE:  906720
## RMSE:  952.2185
## MAE:  856.5875
## RMSLE:  0.06141063
## Mean Residual Deviance :  906720

Now, the case of only Deep Learning. The detailed forecast is in the following table.

# 10-period forecast error: h2o.automl only Deep Learning.
error_tbl_DL <- beer %>% 
    filter(lubridate::year(date) == 2022) %>%
    add_column(pred = pred_h2o_DL %>% as_tibble() %>% pull(predict)) %>%
    rename(actual = sales) %>%
    mutate(error = actual - pred, error_pct = error / actual) 
kable(error_tbl_DL, 
caption = "Detailed performance: h2o.automl only Deep Learning algoritm.",
      digits = 3, row.names = TRUE) %>%
kable_styling(latex_options = "HOLD_position")
Table 3.7: Detailed performance: h2o.automl only Deep Learning algoritm.
symbol date actual pred error error_pct
1 S4248SM144NCEN 2022-01-01 11926 11322.76 603.238 0.051
2 S4248SM144NCEN 2022-02-01 13333 12980.61 352.389 0.026
3 S4248SM144NCEN 2022-03-01 16165 14836.85 1328.146 0.082
4 S4248SM144NCEN 2022-04-01 15584 13080.55 2503.447 0.161
5 S4248SM144NCEN 2022-05-01 16600 15797.52 802.482 0.048
6 S4248SM144NCEN 2022-06-01 17700 16932.62 767.377 0.043
7 S4248SM144NCEN 2022-07-01 15031 14228.38 802.625 0.053
8 S4248SM144NCEN 2022-08-01 16860 16210.62 649.376 0.039
9 S4248SM144NCEN 2022-09-01 16305 15983.39 321.611 0.020
10 S4248SM144NCEN 2022-10-01 15418 16108.88 -690.878 -0.045

The same information in a plot.

beer %>%
    filter(date > as.Date("2021-01-01")) %>%
    ggplot(aes(x = date, y = sales)) +
    # Data.
    geom_point(size = 5, color = "black", alpha = 0.5, 
               shape = 21, fill = "black") +
    geom_line(color = "black", size = 0.5) +
    # Predictions.
    geom_point(aes(y = pred), size = 5, 
               color = "black", alpha = 0.5, shape = 21, 
               fill = "red", data = error_tbl_DL) +
geom_line(aes(y = pred), color = "red", size = 0.5, 
          data = error_tbl_DL) +
    geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), 
               linetype = 2) +
    # Aesthetics.
    labs(x = "Date", y = "Sales",
    caption = c(paste("MAPE=",((mean(abs(error_tbl_DL$error_pct))*100)))))
Forecast: H2O including only Deep Learning algorithm.

Figure 3.4: Forecast: H2O including only Deep Learning algorithm.

Additional performance indicators.

# Only Deep Learning.
h2o.performance(automl_DL, newdata = test_h2o)
## H2ORegressionMetrics: deeplearning
## 
## MSE:  1139878
## RMSE:  1067.651
## MAE:  882.1569
## RMSLE:  0.07260393
## Mean Residual Deviance :  1139878

This is the H2O case with no restrictions, considering all available algorithms.

# 10-period forecast error: h2o.automl all algorithms.
error_tbl_all <- beer %>% 
    filter(lubridate::year(date) == 2022) %>%
    add_column(pred = pred_h2o_all %>% as_tibble() %>% pull(predict)) %>%
    rename(actual = sales) %>%
    mutate(error = actual - pred, error_pct = error / actual) 
kable(error_tbl_all, 
      caption = "Detailed performance: h2o.automl all algorithms.",
      digits = 3, row.names = TRUE) %>%
kable_styling(latex_options = "HOLD_position")
Table 3.8: Detailed performance: h2o.automl all algorithms.
symbol date actual pred error error_pct
1 S4248SM144NCEN 2022-01-01 11926 12139.63 -213.629 -0.018
2 S4248SM144NCEN 2022-02-01 13333 13544.17 -211.167 -0.016
3 S4248SM144NCEN 2022-03-01 16165 15073.50 1091.502 0.068
4 S4248SM144NCEN 2022-04-01 15584 13761.56 1822.440 0.117
5 S4248SM144NCEN 2022-05-01 16600 15704.23 895.767 0.054
6 S4248SM144NCEN 2022-06-01 17700 17116.70 583.301 0.033
7 S4248SM144NCEN 2022-07-01 15031 14367.99 663.006 0.044
8 S4248SM144NCEN 2022-08-01 16860 16221.53 638.467 0.038
9 S4248SM144NCEN 2022-09-01 16305 15960.68 344.321 0.021
10 S4248SM144NCEN 2022-10-01 15418 15733.39 -315.388 -0.020

The visual representation.

beer %>%
    filter(date > as.Date("2021-01-01")) %>%
    ggplot(aes(x = date, y = sales)) +
    # Data.
    geom_point(size = 5, color = "black", alpha = 0.5, 
               shape = 21, fill = "black") +
    geom_line(color = "black", size = 0.5) +
    # Predictions.
    geom_point(aes(y = pred), size = 5, 
               color = "black", alpha = 0.5, shape = 21, 
               fill = "red", data = error_tbl_all) +
geom_line(aes(y = pred), color = "red", size = 0.5, 
          data = error_tbl_all) +
geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), linetype = 2) +
    # Aesthetics.
    labs(x = "Date", y = "Sales",
    caption = c(paste("MAPE=",((mean(abs(error_tbl_all$error_pct))*100)))))
Forecast: H2O including all available algorithms.

Figure 3.5: Forecast: H2O including all available algorithms.

Additional performance metrics.

h2o.performance(automl_leader_all, newdata = test_h2o)
## H2ORegressionMetrics: deeplearning
## 
## MSE:  681077.3
## RMSE:  825.2741
## MAE:  677.8987
## RMSLE:  0.05431715
## Mean Residual Deviance :  681077.3

These plots show the power of modern forecasting techniques. In finance we care about the future and these techniques can be used as a tool to reduce the uncertainty about the future. Obviously, we cannot predict without errors, but the objective is to achieve the lowest forecasting errors possible.

3.6 Summary H2O models.

It is useful to see the performance results for the three different H2O runs above. First, the performance for the overall 10-period forecast.

# There might be a more compact way to create this table.
error_tbl_summ <- error_tbl %>%
    summarise(model = automl_leader@algorithm,
      me = mean(error), rmse = mean(error^2)^0.5,
              mae  = mean(abs(error)), mape = 100 * mean(abs(error_pct)),
              mpe  = 100 * mean(error_pct)) 
error_tbl_DL_summ <- error_tbl_DL %>%
    summarise(model = automl_DL@algorithm,
      me = mean(error), rmse = mean(error^2)^0.5,
              mae = mean(abs(error)), mape = 100 * mean(abs(error_pct)),
              mpe = 100 * mean(error_pct))
error_tbl_all_summ <- error_tbl_all %>%
    summarise(model = automl_leader_all@algorithm,
      me = mean(error), rmse = mean(error^2)^0.5,
              mae  = mean(abs(error)), mape = 100 * mean(abs(error_pct)),
              mpe  = 100 * mean(error_pct))
error_automl_summ <- rbind(error_tbl_summ, error_tbl_DL_summ,
                                error_tbl_all_summ) %>%
  as.data.frame()

row.names(error_automl_summ) <- c("Without Deep Learning", 
                                  "Only Deep Learning", "All algorithms")

kable(error_automl_summ, 
caption = "Top ranked models: h2o.automl summary forecasting errors.",
digits = 2) %>%
kable_styling(latex_options = "HOLD_position")
Table 3.9: Top ranked models: h2o.automl summary forecasting errors.
model me rmse mae mape mpe
Without Deep Learning stackedensemble 744.54 952.22 856.59 5.40 4.67
Only Deep Learning deeplearning 743.98 1067.65 882.16 5.68 4.78
All algorithms deeplearning 529.86 825.27 677.90 4.29 3.20

As you can see, there are several ways in which we can measure the forecast errors. We can specify which one is the evaluation criterion to rank the models. And we can also determine which error measure: me (mean error), rmse (root mean squared error), mae (mean absolute error), mape (mean absolute percentage error), or mpe (mean percentage error) will be the one to choose between these three alternatives. In my experience, the rmse and the mape are the most popular ones, but the others might be useful in specific circumstances.

We can also show the best point forecast for the three h2o.automl() runs.

point_forecast_1 <- data.frame(
  model = automl_leader@algorithm,
  error_tbl[which.min(abs(error_tbl$error_pct)), 2], 
  error = error_tbl[which.min(abs(error_tbl$error_pct)), 6])
point_forecast_2 <- data.frame(
  model = automl_DL@algorithm,
  error_tbl_DL[which.min(abs(error_tbl_DL$error_pct)), 2],
  error = error_tbl_DL[which.min(abs(error_tbl_DL$error_pct)), 6])
point_forecast_3 <- data.frame(
  model = automl_leader_all@algorithm,
  error_tbl_all[which.min(abs(error_tbl_all$error_pct)), 2],
  error = error_tbl_all[which.min(abs(error_tbl_all$error_pct)), 6])
point_forecast <- rbind.data.frame(point_forecast_1, point_forecast_2,
                                   point_forecast_3)
row.names(point_forecast) <- c("Without Deep Learning", 
                                  "Only Deep Learning", "All algorithms")
kable(point_forecast, 
caption = "Top ranked models: Lowest point forecast percentage errors.",
digits = 6) %>%
kable_styling(latex_options = "HOLD_position")
Table 3.10: Top ranked models: Lowest point forecast percentage errors.
model date error_pct
Without Deep Learning stackedensemble 2022-07-01 0.018831
Only Deep Learning deeplearning 2022-09-01 0.019725
All algorithms deeplearning 2022-02-01 -0.015838

We normally do not choose a model according to one specific point forecast. However, it is interesting to see which alternative and which specific date has been forecasted with the highest accuracy.

4 Two more forecast models.

4.1 Linear regression.

Let’s implement a simple but powerful approach using the lm() function.

4.2 Implement the model.

This is the simplest choice, and still has a very high \(R^2\). The independent variables are all beer_aug variables except for date, diff, and symbol.

# linear regression model used, but can use any model
fit_lm <- lm(sales ~ ., data = 
               select(beer_aug, -c(date, diff, symbol)))
summary(fit_lm)
## 
## Call:
## lm(formula = sales ~ ., data = select(beer_aug, -c(date, diff, 
##     symbol)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1241.86  -348.16   -31.98   361.84  1321.38 
## 
## Coefficients: (16 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.722e+08  1.789e+08   1.521  0.13068    
## index.num     4.394e-03  2.877e-03   1.527  0.12922    
## year         -1.125e+05  9.321e+04  -1.207  0.22965    
## year.iso     -2.563e+04  8.495e+03  -3.017  0.00309 ** 
## half         -2.579e+03  1.005e+03  -2.567  0.01143 *  
## quarter      -2.720e+04  3.651e+04  -0.745  0.45760    
## month         4.234e+03  1.217e+04   0.348  0.72856    
## month.xts            NA         NA      NA       NA    
## month.lbl.L          NA         NA      NA       NA    
## month.lbl.Q  -1.885e+03  3.086e+02  -6.107 1.15e-08 ***
## month.lbl.C   6.623e+02  8.590e+02   0.771  0.44216    
## month.lbl^4   9.403e+02  2.242e+02   4.193 5.12e-05 ***
## month.lbl^5   5.174e+02  7.028e+02   0.736  0.46290    
## month.lbl^6  -1.354e+02  2.690e+02  -0.503  0.61556    
## month.lbl^7  -1.052e+02  3.154e+02  -0.334  0.73928    
## month.lbl^8   5.170e+02  5.448e+02   0.949  0.34448    
## month.lbl^9          NA         NA      NA       NA    
## month.lbl^10  7.177e+02  3.734e+02   1.922  0.05684 .  
## month.lbl^11         NA         NA      NA       NA    
## day                  NA         NA      NA       NA    
## hour                 NA         NA      NA       NA    
## minute               NA         NA      NA       NA    
## second               NA         NA      NA       NA    
## hour12               NA         NA      NA       NA    
## am.pm                NA         NA      NA       NA    
## wday         -9.776e+01  3.399e+01  -2.877  0.00472 ** 
## wday.xts             NA         NA      NA       NA    
## wday.lbl.L           NA         NA      NA       NA    
## wday.lbl.Q   -1.088e+03  1.657e+02  -6.566 1.20e-09 ***
## wday.lbl.C    3.899e+02  1.437e+02   2.714  0.00758 ** 
## wday.lbl^4   -2.318e+01  1.698e+02  -0.136  0.89165    
## wday.lbl^5    1.466e+02  1.490e+02   0.984  0.32688    
## wday.lbl^6    9.847e+01  1.345e+02   0.732  0.46542    
## mday                 NA         NA      NA       NA    
## qday         -2.999e+02  4.032e+02  -0.744  0.45829    
## yday         -8.948e+01  1.919e+02  -0.466  0.64185    
## mweek         1.286e+01  2.313e+02   0.056  0.95574    
## week         -2.781e+02  3.080e+02  -0.903  0.36822    
## week.iso     -4.789e+02  1.637e+02  -2.926  0.00407 ** 
## week2         4.874e+02  2.642e+02   1.845  0.06737 .  
## week3                NA         NA      NA       NA    
## week4                NA         NA      NA       NA    
## mday7                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 564.5 on 127 degrees of freedom
## Multiple R-squared:  0.9566, Adjusted R-squared:  0.9477 
## F-statistic: 107.6 on 26 and 127 DF,  p-value: < 2.2e-16

At first sight, the model looks promising.

4.3 Predict.

Prediction is easy in R.

# Make predictions
pred <- predict(fit_lm, newdata = test_tbl)
future_idx <- tail(beer$date, 10) # The 10-months forecast period.
predictions_tbl <- tibble(date  = future_idx, value = pred)
predictions_tbl
## # A tibble: 10 × 2
##    date        value
##    <date>      <dbl>
##  1 2022-01-01 12261.
##  2 2022-02-01 13693.
##  3 2022-03-01 15528.
##  4 2022-04-01 14711.
##  5 2022-05-01 15766.
##  6 2022-06-01 17173.
##  7 2022-07-01 14860.
##  8 2022-08-01 16223.
##  9 2022-09-01 15647.
## 10 2022-10-01 15363.

We can investigate the error on our test set (actuals vs predictions).

# Investigate test error
actuals_tbl <- tail(beer[-1], 10)
error_tbl_lm <- left_join(actuals_tbl, predictions_tbl) %>%
    rename(actual = sales, pred = value) %>%
    mutate(error = actual - pred, error_pct = error / actual) 
error_tbl_lm
## # A tibble: 10 × 5
##    date       actual   pred  error error_pct
##    <date>      <int>  <dbl>  <dbl>     <dbl>
##  1 2022-01-01  11926 12261. -335.   -0.0281 
##  2 2022-02-01  13333 13693. -360.   -0.0270 
##  3 2022-03-01  16165 15528.  637.    0.0394 
##  4 2022-04-01  15584 14711.  873.    0.0560 
##  5 2022-05-01  16600 15766.  834.    0.0502 
##  6 2022-06-01  17700 17173.  527.    0.0298 
##  7 2022-07-01  15031 14860.  171.    0.0114 
##  8 2022-08-01  16860 16223.  637.    0.0378 
##  9 2022-09-01  16305 15647.  658.    0.0403 
## 10 2022-10-01  15418 15363.   54.9   0.00356

And we can calculate a few residuals metrics. A more complex algorithm could produce more accurate results.

# Calculating test error metrics
test_residuals_lm <- error_tbl_lm$error
test_error_pct_lm <- error_tbl_lm$error_pct * 100 # Percentage error.
me   <- mean(test_residuals_lm, na.rm = TRUE)
rmse <- mean(test_residuals_lm^2, na.rm = TRUE)^0.5
mae  <- mean(abs(test_residuals_lm), na.rm = TRUE)
mape <- mean(abs(test_error_pct_lm), na.rm = TRUE)
mpe  <- mean(test_error_pct_lm, na.rm = TRUE)
tibble(me, rmse, mae, mape, mpe) %>% 
  glimpse()
## Rows: 1
## Columns: 5
## $ me   <dbl> 369.6522
## $ rmse <dbl> 570.4356
## $ mae  <dbl> 508.6483
## $ mape <dbl> 3.235825
## $ mpe  <dbl> 2.133975

Visualize our forecast.

beer %>%
    filter(date > as.Date("2021-01-01")) %>%
    ggplot(aes(x = date, y = sales)) +
    # Training data.
    geom_line(color = "black", size = 0.5) +
    geom_point(color = "black", size = 5, alpha = 0.5) +
    # Predictions.
    geom_line(aes(y = value), size = 0.5,
              color = "red", data = predictions_tbl) +
    geom_point(aes(y = value), size = 5, alpha = 0.5,
               color = "red", data = predictions_tbl) +
    # Actuals
 #   geom_line(color = "black", size = 0.5, data = actuals_tbl) +
 #   geom_point(color = "black", size = 5, alpha = 0.5, data = actuals_tbl) +
geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), linetype = 2) +
    # Aesthetics
    theme_tq() +
    labs(x = "Date", y = "Sales",
        caption = c(paste("MAPE=",((mean(abs(test_error_pct_lm)))))))
Forecast: multivariate linear regression.

Figure 4.1: Forecast: multivariate linear regression.

This is clearly a good alternative. The H2O machine learning could lead to better results if we consider a longer time-series because in that way the possibilities to learn increases.

4.4 ARIMA.

Here, sweep is used for tidying the forecast package workflow. We’ll work through an ARIMA analysis to forecast the next 10 months of time series data. In this way we can compare our previous results.

4.5 Prepare the data.

The tk_ts coerce time series objects and tibbles with date/date-time columns to ts (time-series).

# Convert from tbl to ts.
beer_sales_ts <- tk_ts(beer[1:144,], start = 2010, freq = 12)
beer_sales_ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 2010  6558  7481  9475  9424  9351 10552  9077  9273  9420  9413  9866 11455
## 2011  6901  8014  9832  9281  9967 11344  9106 10469 10085  9612 10328 11483
## 2012  7486  8641  9709  9423 11342 11274  9845 11163  9532 10754 10953 11922
## 2013  8383  8870 10085 10462 12177 11342 11139 11409 10442 11480 11077 12635
## 2014  8506  9004  9991 10904 11709 11815 10875 10884 10724 11697 10352 13153
## 2015  8278  8925 10556 10932 11329 12709 11700 11079 11882 11866 11420 14103
## 2016  8557 10201 11951 11255 12048 13456 10755 12465 12037 11671 12757 14133
## 2017  8863 10242 12231 11257 13266 14420 11162 13098 11624 12396 12920 13883
## 2018  9277 10093 12270 11526 13716 14140 12245 13792 12023 13529 13786 15174
## 2019 10447 10833 12301 12944 14430 14145 13355 14058 12912 14389 13652 16158
## 2020 10580 11267 13381 12232 13997 16112 15670 15281 15506 16188 14946 16335
## 2021 11358 12369 15333 15585 15455 17776 15692 16104 15824 15613 16896 18139

Just verify tk_ts worked.

# Check that ts-object has a timetk index.
has_timetk_idx(beer_sales_ts)
## [1] TRUE

Great. This will be important when we use sw_sweep later. Next, we’ll model using ARIMA.

4.6 Implement the model.

We can use the auto.arima function from the forecast package to model the time series. By doing that, we do not have to impose a specific ARIMA model, the function can test the best specification for us.

# Model using auto.arima.
set.seed(13)
fit_arima <- auto.arima(beer_sales_ts)
fit_arima
## Series: beer_sales_ts 
## ARIMA(1,1,2)(2,1,1)[12] 
## 
## Coefficients:
##           ar1     ma1      ma2    sar1     sar2     sma1
##       -0.7346  0.0933  -0.7317  0.3952  -0.3375  -0.8322
## s.e.   0.0956  0.0859   0.0708  0.1125   0.1064   0.1352
## 
## sigma^2 = 295768:  log likelihood = -1016.52
## AIC=2047.05   AICc=2047.96   BIC=2067.18

The sw_tidy() function returns the model coefficients in a tibble (tidy data frame). This might be useful in some circumstances.

# sw_tidy - Get model coefficients.
sw_tidy(fit_arima)
## # A tibble: 6 × 2
##   term  estimate
##   <chr>    <dbl>
## 1 ar1    -0.735 
## 2 ma1     0.0933
## 3 ma2    -0.732 
## 4 sar1    0.395 
## 5 sar2   -0.338 
## 6 sma1   -0.832

The sw_glance() function returns the training set accuracy measures in a tibble (tidy data frame). We use glimpse to aid in quickly reviewing the model metrics.

# sw_glance - Get model description and training set accuracy measures.
sw_glance(fit_arima) %>%
    glimpse()
## Rows: 1
## Columns: 12
## $ model.desc <chr> "ARIMA(1,1,2)(2,1,1)[12]"
## $ sigma      <dbl> 543.8454
## $ logLik     <dbl> -1016.525
## $ AIC        <dbl> 2047.05
## $ BIC        <dbl> 2067.176
## $ ME         <dbl> 44.93784
## $ RMSE       <dbl> 506.6981
## $ MAE        <dbl> 376.1961
## $ MPE        <dbl> 0.1110231
## $ MAPE       <dbl> 3.078059
## $ MASE       <dbl> 0.5500734
## $ ACF1       <dbl> -0.03358588

This looks good.

4.7 Predict.

The sw_augument() function helps with model evaluation. We get the “.actual”, “.fitted” and “.resid” columns, which are useful in evaluating the model against the training data. Note that we can pass timetk_idx = TRUE to return the original date index.

# sw_augment - get model residuals
sw_augment(fit_arima, timetk_idx = TRUE)
## # A tibble: 144 × 4
##    index      .actual .fitted .resid
##    <date>       <dbl>   <dbl>  <dbl>
##  1 2010-01-01    6558   6554.  3.79 
##  2 2010-02-01    7481   7479.  2.41 
##  3 2010-03-01    9475   9472.  3.26 
##  4 2010-04-01    9424   9422.  2.39 
##  5 2010-05-01    9351   9349.  1.84 
##  6 2010-06-01   10552  10549.  2.63 
##  7 2010-07-01    9077   9076.  0.882
##  8 2010-08-01    9273   9272.  0.956
##  9 2010-09-01    9420   9419.  0.988
## 10 2010-10-01    9413   9412.  0.882
## # … with 134 more rows

We can visualize the residual diagnostics for the training data to make sure there is no pattern leftover. This looks homoscedastic.

sw_augment(fit_arima, timetk_idx = TRUE) %>%
    ggplot(aes(x = index, y = .resid)) +
    geom_point(size = 5, alpha = 0.5) + 
 # geom_text(aes(label = paste0(index)), 
   #         vjust = -1, color = "black", size = 3.5) +
    geom_hline(yintercept = 0, color = "red") + 
    labs(x = "Date", y = "Residuals") +
    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
    theme_tq()
Forecast: ARIMA residual diagnosis.

Figure 4.2: Forecast: ARIMA residual diagnosis.

Make a forecast using the forecast() function. This function also delivers some convenient error bounds.

# Forecast next 10 months
fcast_arima <- forecast(fit_arima, h = 10)
fcast_arima
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2022       12912.37 12213.31 13611.43 11843.25 13981.50
## Feb 2022       13894.25 13151.66 14636.84 12758.55 15029.94
## Mar 2022       16187.49 15441.76 16933.21 15047.00 17327.97
## Apr 2022       16546.23 15773.56 17318.89 15364.54 17727.91
## May 2022       16941.50 16161.88 17721.12 15749.17 18133.83
## Jun 2022       18069.53 17270.24 18868.82 16847.13 19291.93
## Jul 2022       15880.12 15071.41 16688.82 14643.30 17116.93
## Aug 2022       16848.93 16024.12 17673.74 15587.49 18110.37
## Sep 2022       16054.03 15218.51 16889.55 14776.21 17331.85
## Oct 2022       16289.50 15439.79 17139.22 14989.97 17589.03

One problem is the forecast output is not tidy. We need it in a data frame if we want to work with it using the tidyverse functionality. The class is forecast, which is a ts-based-object.

class(fcast_arima)
## [1] "forecast"

We can use sw_sweep() to tidy the forecast output. As an added benefit, if the forecast-object has a timetk index, we can use it to return a date/datetime index as opposed to regular index from the ts-based-object.

First, let’s check if the forecast-object has a timetk index.

# Check if object has timetk index 
has_timetk_idx(fcast_arima)
## [1] TRUE

Great. Now, use sw_sweep() to tidy the forecast output.

fcast_tbl <- sw_sweep(fcast_arima, timetk_idx = TRUE)
tail(fcast_tbl, 12)
## # A tibble: 12 × 7
##    index      key       sales  lo.80  lo.95  hi.80  hi.95
##    <date>     <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 2021-11-01 actual   16896     NA     NA     NA     NA 
##  2 2021-12-01 actual   18139     NA     NA     NA     NA 
##  3 2022-01-01 forecast 12912. 12213. 11843. 13611. 13981.
##  4 2022-02-01 forecast 13894. 13152. 12759. 14637. 15030.
##  5 2022-03-01 forecast 16187. 15442. 15047. 16933. 17328.
##  6 2022-04-01 forecast 16546. 15774. 15365. 17319. 17728.
##  7 2022-05-01 forecast 16941. 16162. 15749. 17721. 18134.
##  8 2022-06-01 forecast 18070. 17270. 16847. 18869. 19292.
##  9 2022-07-01 forecast 15880. 15071. 14643. 16689. 17117.
## 10 2022-08-01 forecast 16849. 16024. 15587. 17674. 18110.
## 11 2022-09-01 forecast 16054. 15219. 14776. 16890. 17332.
## 12 2022-10-01 forecast 16290. 15440. 14990. 17139. 17589.

We can investigate the error on our test set (actuals vs predictions).

# Investigate test error 
error_tbl_arima <- left_join(actuals_tbl, fcast_tbl, 
                             by = c("date" = "index")) %>%
    rename(actual = sales.x, pred = sales.y) %>%
    select(date, actual, pred) %>%
    mutate(error = round((actual - pred), 2), 
           error_pct = round((error / actual), 4)) 
error_tbl_arima
## # A tibble: 10 × 5
##    date       actual   pred  error error_pct
##    <date>      <int>  <dbl>  <dbl>     <dbl>
##  1 2022-01-01  11926 12912. -986.    -0.0827
##  2 2022-02-01  13333 13894. -561.    -0.0421
##  3 2022-03-01  16165 16187.  -22.5   -0.0014
##  4 2022-04-01  15584 16546. -962.    -0.0617
##  5 2022-05-01  16600 16941. -342.    -0.0206
##  6 2022-06-01  17700 18070. -370.    -0.0209
##  7 2022-07-01  15031 15880. -849.    -0.0565
##  8 2022-08-01  16860 16849.   11.1    0.0007
##  9 2022-09-01  16305 16054.  251.     0.0154
## 10 2022-10-01  15418 16290. -872.    -0.0565

And we can calculate a few residuals metrics.

# Calculate test error metrics
test_residuals_arima <- error_tbl_arima$error
test_error_pct_arima <- error_tbl_arima$error_pct * 100 # Percentage error
me   <- mean(test_residuals_arima, na.rm=TRUE)
rmse <- mean(test_residuals_arima^2, na.rm=TRUE)^0.5
mae  <- mean(abs(test_residuals_arima), na.rm=TRUE)
mape <- mean(abs(test_error_pct_arima), na.rm=TRUE)
mpe  <- mean(test_error_pct_arima, na.rm=TRUE)
tibble(me, rmse, mae, mape, mpe) %>% 
  glimpse()
## Rows: 1
## Columns: 5
## $ me   <dbl> -470.195
## $ rmse <dbl> 633.334
## $ mae  <dbl> 522.603
## $ mape <dbl> 3.585
## $ mpe  <dbl> -3.263

Notice that we have the entire forecast in a tibble. We can now more easily visualize the forecast.

fcast_tbl %>%
    filter(index > as.Date("2021-01-01")) %>%
    ggplot(aes(x = index, y = sales, color = key)) +
    # 95% CI
    geom_ribbon(aes(ymin = lo.95, ymax = hi.95), 
                fill = "#D5DBFF", color = NA, size = 0) +
    # 80% CI
    geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), 
                fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
    # Prediction
  geom_line() +
geom_point(size = 5, alpha = 0.5) +
    #geom_line(size = 0.5, color = "black") +
    #geom_point(size = 5, alpha = 0.5, color = "black") +
    # Actuals
    geom_line(aes(x = date, y = sales), color = "black", 
              data = actuals_tbl) +
    geom_point(aes(x = date, y = sales), color = "black", size = 5,
               alpha = 0.5, data = actuals_tbl) +
        geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), 
                  linetype = 2) +
    # Aesthetics
labs(x = "Date", y = "Sales",
     caption = c(paste("MAPE=",((mean(abs(test_error_pct_arima))))))) +
    scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
    scale_color_tq() +
    scale_fill_tq() +
    theme_tq()
Forecast: ARIMA.

Figure 4.3: Forecast: ARIMA.

This is a decent forecast.

5 Summary of all results.

An interesting question is: What happens to the accuracy when you average the predictions of all different methods? This question makes sense because the decision of using one technique or another is not trivial. Taking the average could be useful to avoid extreme results but at the same time it could be hard to interpret as the forecast comes from different techniques. In any case, it is interesting to see how it works.

The forecast mean is calculated as:

m <- data.frame(as.data.frame(pred_h2o),as.data.frame(pred_h2o_DL),
          as.data.frame(pred_h2o_all),as.data.frame(predictions_tbl$value),
        as.data.frame(fcast_tbl$sales[(nrow(fcast_tbl)-9):nrow(fcast_tbl)]))
pred_mean <- rowMeans(m)
pred_mean <- as.tibble(pred_mean)

Now let’s see actual versus predicted.

error_tbl_mean <-as_tibble(c(as.data.frame(actuals_tbl),
                        as.data.frame(pred_mean$value))) %>%
    rename(actual = sales, pred = `pred_mean$value`) %>%
    select(date, actual, pred) %>%
    mutate(error = round((actual - pred), 2), 
           error_pct = round((error / actual), 4))
error_tbl_mean
## # A tibble: 10 × 5
##    date       actual   pred  error error_pct
##    <date>      <int>  <dbl>  <dbl>     <dbl>
##  1 2022-01-01  11926 12064. -138.    -0.0116
##  2 2022-02-01  13333 13347.  -14.1   -0.0011
##  3 2022-03-01  16165 15304.  861.     0.0533
##  4 2022-04-01  15584 14428. 1156.     0.0742
##  5 2022-05-01  16600 16016.  584.     0.0352
##  6 2022-06-01  17700 17129.  571.     0.0322
##  7 2022-07-01  15031 14817.  214.     0.0142
##  8 2022-08-01  16860 16281.  579.     0.0344
##  9 2022-09-01  16305 15806.  499.     0.0306
## 10 2022-10-01  15418 15895. -477.    -0.0309

Summarize the individual point forecast errors.

error_tbl_mean %>%
    summarise(me = mean(error), rmse = mean(error^2)^0.5,
        mae  = mean(abs(error)), mape = mean(abs(error_pct)),
        mpe  = mean(error_pct)) %>%
glimpse()
## Rows: 1
## Columns: 5
## $ me   <dbl> 383.566
## $ rmse <dbl> 601.8193
## $ mae  <dbl> 509.356
## $ mape <dbl> 0.03177
## $ mpe  <dbl> 0.02305

Visualize the average forecast.

beer %>%
    filter(date > as.Date("2021-01-01")) %>%
    ggplot(aes(x = date, y = sales)) +
    # Training data
    geom_line(color = "black") +
    geom_point(size = 5, alpha = 0.5) +
    # Predictions
    geom_point(aes(y = pred), size = 5, 
               alpha = 0.5, shape = 21, 
               fill = "red", data = error_tbl) +
geom_line(aes(y = pred), color = "red", size = 0.5, data = error_tbl) +
geom_vline(xintercept = as.numeric(as.Date("2021-12-01")), linetype = 2) +
      # Actuals
    geom_line(color = "black", data = actuals_tbl) +
    geom_point(size =5, alpha = 0.5, data = actuals_tbl) +
geom_vline(xintercept = as.numeric(as.Date("2016-12-01")), linetype = 2) +
    # Aesthetics
    theme_tq() +
    labs(x = "Date", y = "Sales",
caption = c(paste("MAPE=",((100*mean(abs(error_tbl_mean$error_pct)))))))
Forecast: average forecast.

Figure 5.1: Forecast: average forecast.

Not bad, as expected.

Let’s see all results at once: H2O, linear regression, ARIMA and the average forecast.

summary_techniques <- c("H2O without Deep Learning algorithm",
             "H2O including only Deep Learning algorithm",
             "H2O all available algorithms",
             "Multivariate linear regression",
             "ARIMA",
             "Average")
summary_mape <- c(mean(abs(error_tbl$error_pct))*100,
             mean(abs(error_tbl_DL$error_pct))*100,
             mean(abs(error_tbl_all$error_pct))*100,
             mean(abs(test_error_pct_lm)),
             mean(abs(test_error_pct_arima)),
             100*mean(abs(error_tbl_mean$error_pct)))
sum <- data.frame(summary_techniques, summary_mape)
kable(sum, caption = "Summary of results.") %>%
kable_styling(latex_options = "HOLD_position")
Table 5.1: Summary of results.
summary_techniques summary_mape
H2O without Deep Learning algorithm 5.397373
H2O including only Deep Learning algorithm 5.679607
H2O all available algorithms 4.286843
Multivariate linear regression 3.235825
ARIMA 3.585000
Average 3.177000

Nice.

h2o.shutdown(prompt = TRUE) # yes (Y) instead of TRUE?
## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

5.1 Unsorted references.

The main web references of this document are (these are web links):

This document took 24.26 minutes to compile in Rmarkdown, R version 4.2.1.

References.

Dagum, Estela Bee, and Silvia Bianconcini. 2016. Seasonal Adjustment Methods and Real Time Trend-Cycle Estimation. Springer.
Hull, John C. 2020. Machine Learning in Business: An Introduction to the World of Data Science. Amazon Distribution.
Hyndman, Rob J, and George Athanasopoulos. 2021. Forecasting: Principles and Practice. OTexts.
Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.